runZinb <- T
runClus <- T
NCORES <- 7

EAP: small request. Can everyone put a line between the beginning of a r chunk and text? It makes it nicely formated for my text editor.

EAP: if we are going to ultimately use these functions, it would be good to add them to the package for zinbwave.

FP: yes, I agree. Davide, we wait for zinbwave to be on bioconductor before to push changes, right?

compute.zinb.loglik <- function(Y, zinb){
  mu = t(getMu(zinb)) 
  theta = getTheta(zinb)
  theta_mat = matrix(rep(theta, ncol(Y), ncol = ncol(Y)))
  pi = t(getPi(zinb))
  log( pi * (Y == 0) + (1 - pi) * dnbinom(Y, size = theta, mu = mu) )
}

compute.deviance.residuals <- function(Y, zinb){
  mu_hat = t(getMu(zinb)) 
  pi_hat = t(getPi(zinb)) 
  Y_hat = (1 - pi_hat) * mu_hat
  ll = compute.zinb.loglik(Y, zinb)
  sign = 1*(Y - Y_hat > 0)
  sign[sign == 0] = -1
  sign * sqrt(-2 * ll)
}

Steps of the workflow

We propose a worklow to analyze single cell RNA-Seq with the following steps

  • Dimensionality reduction using zinbwave to get W which should capture the biology,
  • Cluster cells using clusterExperiment on W to get the cluster labels,
  • Get lineage using slingshot on W and cluster labels from clusterExperiment,
  • Get DE genes between lineages/clusters.

Along the worflow, use deviance residuals as adjusted values.

knitr::include_graphics('../vignettes/workflow.png')
Worflow to analyze single cell RNASeq data

Worflow to analyze single cell RNASeq data

1. Create a SummarizedExperiment object

Along the workflow, we want to use a unique SummarizedExperiement object carrying all the data we need.

EAP: I have updated the code to pull from a dataset on the repos that is created with the createData.R file. For now, I am filtering to the top 1000 most variable genes there, though we might want to add that to the code for the article. This will be slightly different data from Russell’s, which didn’t use all of the samples. We can adjust that decision later, or just compare the samples that are the same. (Russell’s clusterLabels are in the meta data)

EAP: zinbFit doesn’t accept data.frame objects, so currently have to have a data.matrix command. Should it be changed so that it does?

#counts<-read.table("../data/oeCufflinkCountData.txt",sep="\t",header=TRUE)
core <- read.table("../data/oeCufflinkCountData_1000Var.txt",
                   sep = "\t", header = TRUE)
core <- data.matrix(core)
metadata <- read.table("../data/oeMetadata.txt", sep = "\t", header = TRUE)
# symbol for samples missing from original clustering
metadata$clusterLabels[is.na(metadata$clusterLabels)] <- -2

Here we only look at the 1000 most variable genes. EAP: see note above, I’ve commented out the filtering and added it to the createData.R for now.

batch <- metadata$Batch

Cells have been processed in 18 different batches

col_batch = rep(brewer.pal(9, "Set1"), 2)
names(col_batch) = unique(batch)
table(batch)
## batch
## GBC08A GBC08B GBC09A GBC09B    P01    P02   P03A   P03B    P04    P05 
##     41     47     43     38     34     49     73     52     24     23 
##    P06    P10    P11    P12    P13    P14    Y01    Y04 
##     53     51     54     51     60     49     65     42

We have qc measures from the data

qc <- metadata[, !names(metadata) %in% c("Batch", "Experiment", "clusterLabels")]
head(qc, 2)
##                  NREADS NALIGNED  RALIGN TOTAL_DUP    PRIMER
## OEP01_N706_S501 3313260  3167600 95.6035   47.9943 0.0154566
## OEP01_N701_S501 2902430  2757790 95.0167   45.0150 0.0182066
##                 PCT_RIBOSOMAL_BASES PCT_CODING_BASES PCT_UTR_BASES
## OEP01_N706_S501               2e-06         0.200130      0.230654
## OEP01_N701_S501               0e+00         0.182461      0.201810
##                 PCT_INTRONIC_BASES PCT_INTERGENIC_BASES PCT_MRNA_BASES
## OEP01_N706_S501           0.404205             0.165009       0.430784
## OEP01_N701_S501           0.465702             0.150027       0.384271
##                 MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS
## OEP01_N706_S501           0.843857           0.061028           0.521079
## OEP01_N701_S501           0.914370           0.033350           0.373993
##                 CreER ERCC_reads
## OEP01_N706_S501     1      10516
## OEP01_N701_S501  3022       9331
clus.labels <- metadata[, "clusterLabels"]

In original work (FP: add ref), cells have been clustered into 14 different clusters

col_clus <- c("transparent", brewer.pal(12, "Set3"), brewer.pal(8, "Set2"))
col_clus <- col_clus[1:length(unique(clus.labels))]
names(col_clus) <- sort(unique(clus.labels))
table(clus.labels)
## clus.labels
##  -2   1   2   3   4   5   7   8   9  10  11  12  14  15 
## 233  91  25  56  40  96  60  28  79  26  22  35  26  32

Batches are kind of confounded with the biology

table(data.frame(batch = as.vector(batch),
                 cluster = clus.labels))
##         cluster
## batch    -2  1  2  3  4  5  7  8  9 10 11 12 14 15
##   GBC08A  5  0  2 12  9  0  0  0  0  0  2  0  2  9
##   GBC08B 14  0  7  5  3  0  0  0  1  2  4  0  5  6
##   GBC09A 13  0  1  5  9  0  0  0  1  1  0  0  6  7
##   GBC09B 21  0  2  2  7  0  0  0  3  0  0  0  3  0
##   P01     9  0  2  4  3 15  1  0  0  0  0  0  0  0
##   P02     6  2  0  9  3 15  3  3  2  3  0  2  1  0
##   P03A   36  3  0  3  0 12  2  9  4  2  0  2  0  0
##   P03B   19  1  2  1  1 11  1  2 10  1  1  2  0  0
##   P04    10  0  0  0  0 11  1  0  1  1  0  0  0  0
##   P05     3  0  0  0  1 11  3  0  1  0  2  2  0  0
##   P06    14  1  2  3  0  8  2  4  8  4  1  2  2  2
##   P10    15  3  1  4  0  4  5  9  2  0  2  5  0  1
##   P11    10  2  1  1  0  1  5  1 22  3  1  6  0  1
##   P12    11  0  2  0  0  4 10  0  8  2  3  6  4  1
##   P13    13  1  2  4  0  4 15  0  4  5  6  1  3  2
##   P14    10  0  0  1  2  0 12  0 12  2  0  7  0  3
##   Y01    14 47  1  1  2  0  0  0  0  0  0  0  0  0
##   Y04    10 31  0  1  0  0  0  0  0  0  0  0  0  0

We have 849 cells.

dim(core)
## [1] 1000  849
core[1:3, 1:3]
##        OEP01_N706_S501 OEP01_N701_S501 OEP01_N707_S507
## Cbr2              5799            3638            1448
## Cyp2f2            2158            2027            1078
## Gstm1             8763            7221            3581

Let’s create a SummarizedExperiment object to store the raw counts and information about the data, that is batches, original labels, and quality control measures.

se <- SummarizedExperiment(assays = list(rawCounts = core),
                           colData = metadata)

2. Dimensionality reduction adjusting for gene and cell-level covariates

To cluster and get lineages we want to reduce the dimension of the data. We are going to use zinbwave to do so. First, let’s fit zinbwave with first K = 0 to compute normalized values (i.e. deviance residuals) adjusted for batches. We could also adjust for gene length or GC content here. We then fit zinbwave to get the dimensionality reduced matrix W specifying the number of dimension K = 50. Eventually, we will call zinbwave just once where we would have an argument in zinbFit like “compute_normalized_values” in c(TRUE, FALSE). For K = 0 and K = 50, we correct for batch effect including batches in X.

fn0 <- '../data/zinb_k0_batch.rda'
if (runZinb & !file.exists(fn0)){
  mod <- model.matrix( ~ batch)
  print(system.time(zinb0 <- zinbFit(core, ncores = NCORES,
                                     K = 0, X = mod)))
  save(zinb0, file = fn0)
}else{
  load(fn0)
}

fn50 <- '../data/zinb_k50_batch.rda'
if (runZinb & !file.exists(fn50)){
  mod <- model.matrix( ~ batch)
  print(system.time(zinb50 <- zinbFit(core, ncores = NCORES,
                                      K = 50, X = mod)))
  save(zinb50, file = fn50)
}else{
  load(fn50)
}

Normalized values

We use deviance residuals as normalized values for visualization. FP: explain rational: K=0 so residuals capture the bio adjusting for batch. Let’s check that deviance residuals look ok.

res <- compute.deviance.residuals(core, zinb0)
res[1:3,1:3]
##        OEP01_N706_S501 OEP01_N701_S501 OEP01_N707_S507
## Cbr2          4.541942       -4.431585       -4.239279
## Cyp2f2        4.316532        4.301158       -4.146744
## Gstm1         4.629884        4.583829       -4.429794

Boxplot of the normalized values for each cell. It seems that correction for batches is ok.

res_order <- res[, order(as.numeric(batch))]
col_order <- as.numeric(batch)[order(as.numeric(batch))]
boxplot(res_order, main='Boxplot of normalized values\ncolor=batch',
        col = col_order, staplewex = 0, outline = 0, border = col_order, xaxt = 'n')

PCA on the normalized values where color are for batches on the left and previously found clusters on the right. We want no clustering on the left side and clustering on the right side.

pca <- prcomp(t(res))
par(mfrow = c(1,2))
plot(pca$x, col = col_batch[batch], pch = 20,
     main="PCA of normalized values\ncolor=batch")
plot(pca$x, col = col_clus[as.character(clus.labels)], pch = 20,
     main = "PCA of normalized values\ncolor=cluster")

par(mfrow = c(1,1))

Let’s add the normalized values as an assay dataset in our SummarizedExperiment object.

assays(se)[['normalizedValues']] <- res

Dimensionality reduction

Let’s check that performing MDS on W we have something coherent with original clusters.

W <- getW(zinb50)
d <- dist(W)
fit <- cmdscale(d, eig = TRUE, k = 2)
plot(fit$points, col = col_clus[as.character(clus.labels)], main = 'MDS', pch = 20,
     xlab = 'Component 1', ylab = 'Component 2')
legend(x = 'bottomright', legend = unique(names(col_clus)), cex = .5,
       fill = unique(col_clus), title = 'Sample')

Let’s add W to the colData of our SummarizedExperiment object.

W <- data.frame(W)
colnames(W) <- paste0('W', 1:ncol(W))
colData(se) <- cbind(colData(se), W)

FP: what do you think of implemented a “compute_normalized_values” argument to zinbFit for summarizedExperiment objects where zinbFit would take as input a SummarizedExperiment object and return a summarizedExperiment object with slots added for normalized values and W?

2. Clustering of the cells

We use clusterExperiment with W.

FP: What about the following call to RSEC?

EP: I updated it to work on a SE object so that it has the meta data. If you have a summarized experiment object with W already, you could use that as long as assay(seObj) gives W.

W <- colData(se)[, grepl('^W', colnames(colData(se)))]
W <- as.matrix(W)
fn <- '../data/RSEC_W.rda'
if (runClus & !file.exists(fn)){
  #symbol for samples missing from original clustering
    metadata$clusterLabels[is.na(metadata$clusterLabels)]<- -2 
  seObj<-SummarizedExperiment(t(W),colData=metadata)
  print(system.time(ceObj <- RSEC(seObj, k0s = 4:15, alphas = c(0.1),
                                  betas = 0.8,
                clusterFunction = "hierarchical01", minSizes=1,
                ncores = NCORES, isCount=FALSE,
                subsampleArgs = list(resamp.num=100,
                                     clusterFunction="kmeans",
                                     clusterArgs=list(nstart=10)),
                seqArgs = list(k.min=3, top.can=5), verbose=TRUE,
                combineProportion = 0.7,
                mergeMethod = "none")))
  save(ceObj, file = fn)
}else{
  load(fn)
}
plotClusters(ceObj, colPalette = c(bigPalette, rainbow(199)))

plotCoClustering(ceObj)
## Warning in .makeColors(clusters, colors = bigPalette): too many clusters to
## have unique color assignments

table(primaryClusterNamed(ceObj))
## 
##  -1  c1  c2  c3  c4  c5  c6  c7  c8  c9 
## 316 145   5   8  94  90 105  44   7  35
sum(primaryCluster(ceObj) == -1)
## [1] 316

FP: Elizabeth, we are working with the W here, does the locfdr make sense in this context? I set eval=FALSE in the next chunk to skip the merging step, let me know if you would rather keep using it. And if we want to still use the merging step, would we want to include it in RSEC function arguments instead of separately?

EP: I don’t think the merging step on the W makes a whole lot of sense – the method is irrelevant. The merging is based on calculating the % of genes found significant (the specific method is arbitrary). The best thing would be to replace the W with residuals in the assay of ceObj (or whatever data that you will do the DE on for the time stuff below), and then run the merging step on that data. I’m not particularly fond of locfdr. It was probably the method that gave the best merging to Russell and Diya. You’d really have to run mergeClusters setting plotInfo="all" and look at the results and decide both the cutoff level and the method.

EP: Also, if you don’t save the output of mergeClusters it doesn’t update ceObj. I was calling it for just the resulting plots, since it was already merged in RSEC above. I’ve changed to code to update ceObj below.

FP: Ha ok, good to know. I’ll keep the eval=FALSE for the moment.

#re-does merging simpling to make plot 
#something like:
#assay(ceObj)
# if that replacement data should be considered on the transformed scale in plots, etc, the transformation function should be fixed as well:
#transformation(ceObj)
ceObj<-mergeClusters(ceObj, mergeMethod = "locfdr",
              plotInfo = "mergeMethod", cutoff = 0.01)

FP: Not sure we should show the following heatmap using the W, we should rather use the normalized values (following heatmap).

plotHeatmap(ceObj, clusterSamplesData = "dendrogramValue", 
            breaks = .99)

So, let’s look at a heatmap on normalized values.

FP: Elizabeth, I did not find how to define the column annotation track in the plot below to have the same colors as in ceObj@clusterLegend[[1]]. I tried to use arguments annColors and annCol from aheatmap as it is said in plotHeatmap documentation that for signature matrix arguments can be passed to aheatmap. But I got the error “The following arguments to aheatmap cannot be set by the user in plotHeatmap:Rowv,Colv,color,annCol,annColors”.

EP: Fanny, you would need to use the argument ‘clusterLegend’. That argument takes either the format of aheatmap (list with each element a named vector of colors) or the format of the clusterExperiment object (i.e. list with each element a matrix with columns for name and color). So I think the following code will run, though it might need the list to have names…

But an easier fix to the code would be to set visualizeData option. I haven’t tested this because I don’t have the objects need run, so let me know if there is error.

FP: it seems great to me, what do you think?

# sampleData <- data.frame(ours = primaryCluster(ceObj))
# plotHeatmap(assays(se)$normalizedValues,
#             main = 'Normalized values, 1000 most variable genes',
#             clusterSamplesData = ceObj@dendro_samples,
#             sampleData = as.matrix(sampleData),clusterLegend=ceObj@clusterLegend[1])
# easier fix:
plotHeatmap(ceObj, visualizeData = assays(se)$normalizedValues,
            whichClusters = "primary",
            clusterSamplesData = "dendrogramValue",
            main = 'Normalized values, 1000 most variable genes',
            breaks = 0.99)

plot(fit$points, col = col_clus[as.character(clus.labels)],
     main = 'MDS W, color = original clusters', pch = 20,
     xlab = 'Component1', ylab = 'Component2')
legend(x = 'bottomright', legend = unique(names(col_clus)), cex = .5,
       fill = unique(col_clus), title = 'Sample')

palDF <- ceObj@clusterLegend[[1]]
pal <- palDF[, 'color']
names(pal) <- palDF[, 'name']
pal["-1"] = "transparent"
plot(fit$points, col = pal[primaryClusterNamed(ceObj)],
     main = 'MDS W, color = our new clusters', pch = 20,
     xlab = 'Component1', ylab = 'Component2')
legend(x = 'bottomright', legend = names(pal), cex = .5,
       fill = pal, title = 'Sample')

4. Pseudotime ordering

The goal of this section is to see if we need to refit zinbwave when we want to run slingshot. We first run slingshot on the W used by clusterExperiment. In the second part of this section, we fit zinbwave on the matrix of counts where the unassigned cells have been removed. For each part (without or with refitting zinbwave), we run slingshot in the supervised and unsupervised mode and try k=3, k=4, k=5 dimensions in W.

From what I understand, start original clusters are 1 and 5 (HBC) and end original clusters are 15 (Microvillus), 9 and 12 (neuron), and 4, 7 (Sus). Additionally, we want the GBC cluster to be a junction before the differentiation between Microvillus and Neuron. The correspondance with the original clusters is as follow

table(data.frame(original = clus.labels, ours = primaryClusterNamed(ceObj)))
##         ours
## original  -1  c1  c2  c3  c4  c5  c6  c7  c8  c9
##       -2 119  56   2   4   7  26   8   1   5   5
##       1   48  43   0   0   0   0   0   0   0   0
##       2    3   0   0   0  22   0   0   0   0   0
##       3    3   4   0   0  49   0   0   0   0   0
##       4   12   2   0   3   0  23   0   0   0   0
##       5   55  38   3   0   0   0   0   0   0   0
##       7   18   0   0   1   0  41   0   0   0   0
##       8   26   2   0   0   0   0   0   0   0   0
##       9   16   0   0   0   0   0  63   0   0   0
##       10   1   0   0   0   0   0   0  25   0   0
##       11   7   0   0   0  15   0   0   0   0   0
##       12   1   0   0   0   0   0  34   0   0   0
##       14   5   0   0   0   1   0   0  18   2   0
##       15   2   0   0   0   0   0   0   0   0  30
Cluster name Description Correspondence Color
c1 HBC original 1, 5 darkblue
c2 new and small new and small green
c3 new and small new and small red
c4 GBC / immature neurons / MV 1 original 2, 3, 11, 14 orange
c5 Sus original 4, 7 darkpurple
c6 Neuron original 9, 12 brown
c7 Immature Neuron original 10, 14 cian
c8 Immature Neuron original 14 lightpurple
c9 Microvillus original 15 grayblue

FP: We need to work on the cluster coloring (!) and the matching with the clusters you found for the paper with Russell, maybe by changing the parameters in clusterMany?

EP: The function plotClusters aligns the clusters and can return the aligned coloring between clusterings. There is an option to return an updated CE object that changes the colors in clusterLegend to match the new colors found by aligning the clusters (resetNames and resetColors arguments). These haven’t been extensively tested in real life because I don’t think we’ve used them much. But it would be good to demonstrate!

FP: With all the improvements made, I think we simply have the correct colors now.

Kvec <- c(3, 4, 5)

Use previous W

The input of slingshot is the W used for clusterExperiment where the number of dimensions is reduced to k where k in (3, 4, 5) here.

Unsupervised

K = 3 does not seem very good to me: Sus is not an end cluster, GBC is an end cluster.

K = 4 is better, slingshot finds the end clusters but there is a spurious end cluster.

K = 5 does not seem great to me: GBC is an end cluster and Sus and Microvillus are in the same lineage.

our_cl <- primaryClusterNamed(ceObj)
cl = our_cl[our_cl != "-1"]
pal = pal[names(pal) != '-1']
for (k in Kvec){
  X <- W[our_cl != "-1", 1:k]

  lineages <- get_lineages(X, clus.labels = cl, start.clus = "c1")
  curves <- get_curves(X, clus.labels = cl, lineages = lineages)
  plot_curves(X, cl, curves, col.clus = pal)
  plot_tree(X, cl, lineages, col.clus = pal)

  print(paste0("K=", k))
  print(lineages$lineage1)
  print(lineages$lineage2)
  print(lineages$lineage3)
  print(lineages$lineage4)
  print(lineages$lineage5)
}

## [1] "K=3"
## [1] "c1" "c2" "c3" "c5" "c9"
## [1] "c1" "c8" "c7" "c6"
## [1] "c1" "c8" "c4"
## NULL
## NULL

## [1] "K=4"
## [1] "c1" "c8" "c7" "c6"
## [1] "c1" "c8" "c4" "c9"
## [1] "c1" "c3" "c5"
## [1] "c1" "c2"
## NULL

## [1] "K=5"
## [1] "c1" "c2" "c3" "c5" "c9"
## [1] "c1" "c8" "c7" "c6"
## [1] "c1" "c8" "c4"
## NULL
## NULL

Supervised

K = 3 finds GBC as an end cluster (that I did not specify in the end.clus!).

K = 4 Yeah! it seems that it is what we want even if we still have a spurius end cluster and GBC not really at the junction.

K = 5 Yeah! even if GBC not really at the junction.

for (k in Kvec){
  X <- W[our_cl != "-1", 1:k]

  lineages <- get_lineages(X, clus.labels = cl, start.clus = "c1",
                           end.clus = c("c5", "c6", "c9"))
  curves <- get_curves(X, clus.labels = cl, lineages = lineages)
  plot_curves(X, cl, curves, col.clus = pal)
  plot_tree(X, cl, lineages, col.clus = pal)

  print(paste0("K=", k))
  print(lineages$lineage1)
  print(lineages$lineage2)
  print(lineages$lineage3)
  print(lineages$lineage4)
  print(lineages$lineage5)
}

## [1] "K=3"
## [1] "c1" "c2" "c3" "c5"
## [1] "c1" "c8" "c7" "c6"
## [1] "c1" "c8" "c4"
## [1] "c1" "c8" "c9"
## NULL

## [1] "K=4"
## [1] "c1" "c8" "c7" "c6"
## [1] "c1" "c8" "c4" "c9"
## [1] "c1" "c3" "c5"
## [1] "c1" "c2"
## NULL

## [1] "K=5"
## [1] "c1" "c2" "c3" "c5"
## [1] "c1" "c8" "c7" "c6"
## [1] "c1" "c8" "c4" "c9"
## NULL
## NULL

Re-fitting zinbwave

Unsupervised

K = 3 is better than when we did not refit zinbwave but still not perfect: Sus in all the clusters. GBC not really at the junction.

K = 4 good even if GBC not really at the junction.

K = 5 not great GBC is an end cluster.

mod <- model.matrix(~ batch[our_cl != "-1"])
fn <- '../data/refit_zinbwave_slingshot.rda'
if (runZinb & !file.exists(fn)){
  zinbList <- lapply(Kvec, function(k){
    zinbFit(core[, our_cl != "-1"], X = mod, K = k, ncores = NCORES)
  })
  save(zinbList, file = fn)
}else{
  load(fn)
}
for(k in Kvec) {
  X <- getW(zinbList[[k - 2]])[, 1:k]

  lineages <- get_lineages(X, clus.labels = cl, start.clus = "c1")
  curves <- get_curves(X, clus.labels = cl, lineages = lineages)
  plot_curves(X, cl, curves, col.clus = pal)
  plot_tree(X, cl, lineages, col.clus = pal)

  print(paste0("K=", k))
  print(lineages$lineage1)
  print(lineages$lineage2)
  print(lineages$lineage3)
  print(lineages$lineage4)
  print(lineages$lineage5)
}

## [1] "K=3"
## [1] "c1" "c5" "c8" "c7" "c6"
## [1] "c1" "c5" "c8" "c4" "c9"
## [1] "c1" "c5" "c2"
## [1] "c1" "c5" "c3"
## NULL

## [1] "K=4"
## [1] "c1" "c2" "c3" "c5"
## [1] "c1" "c8" "c7" "c6"
## [1] "c1" "c8" "c4" "c9"
## NULL
## NULL

## [1] "K=5"
## [1] "c1" "c2" "c3" "c5"
## [1] "c1" "c8" "c4"
## [1] "c1" "c8" "c6"
## [1] "c1" "c8" "c7"
## [1] "c1" "c8" "c9"

Supervised

K = 3 Yeah! close to perfection.

K = 4 good even if GBC not really at the junction.

K = 5 no, GBC is an end cluster.

for(k in Kvec){
  X <- getW(zinbList[[k - 2]])[, 1:k]

  lineages <- get_lineages(X, clus.labels = cl, start.clus = "c1",
                           end.clus = c("c5", "c6", "c9"))
  curves <- get_curves(X, clus.labels = cl, lineages = lineages)
  plot_curves(X, cl, curves, col.clus = pal)
  plot_tree(X, cl, lineages, col.clus = pal)

  print(paste0("K=", k))
  print(lineages$lineage1)
  print(lineages$lineage2)
  print(lineages$lineage3)
  print(lineages$lineage4)
  print(lineages$lineage5)
}

## [1] "K=3"
## [1] "c1" "c4" "c8" "c7" "c6"
## [1] "c1" "c2" "c3" "c5"
## [1] "c1" "c4" "c9"
## NULL
## NULL

## [1] "K=4"
## [1] "c1" "c2" "c3" "c5"
## [1] "c1" "c8" "c7" "c6"
## [1] "c1" "c8" "c4" "c9"
## NULL
## NULL

## [1] "K=5"
## [1] "c1" "c2" "c3" "c5"
## [1] "c1" "c8" "c4"
## [1] "c1" "c8" "c6"
## [1] "c1" "c8" "c7"
## [1] "c1" "c8" "c9"

CONCLUSION: K = 5 is never great as GBC is generally an end cluster. K = 4 is ok for all the methods and a bit better when zinbwave is refitted. K = 3 when refitting and supervized is good.

It seems to me that using slingshot on W without re-fitting zinbwave with k = 4 gives good results where supervized mode is slightly better than unsupervized. It is just a one shot example and we should obviously not make a general conclusion, but I think that for the purpose of the workflow it is fine to use slingshot without refitting zinbwave. We should write a note to the user that it is better to refit zinbwave to have more power.

5. DE analysis

Here is the kind of plots we want to present

de <- read.csv('../data/oe_markers.txt', stringsAsFactors = F, header = F)
de <- de$V1
plotHeatmap(ceObj, 
            visualizeData = assays(se)$normalizedValues[rownames(se) %in% de, ],
            clusterSamplesData = "dendrogramValue",
            whichClusters = "primary",
            main = 'Normalized values, 1000 most variable genes',
            breaks = 0.99)

FP: Kelly, is it you who did the DE analysis for Russell paper? If yes, what tool did you use? On what data? The full quantile normalized counts? Do you have code available?

Session Info

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rARPACK_0.11-0               digest_0.6.12               
##  [3] RColorBrewer_1.1-2           Rtsne_0.13                  
##  [5] magrittr_1.5                 gplots_3.0.1                
##  [7] ggplot2_2.2.1                slingshot_0.0.3-5           
##  [9] princurve_1.1-12             zinbwave_0.1.4              
## [11] clusterExperiment_1.3.0-9008 scone_1.0.0                 
## [13] scRNAseq_1.2.0               SummarizedExperiment_1.6.3  
## [15] DelayedArray_0.2.4           matrixStats_0.52.2          
## [17] Biobase_2.36.2               GenomicRanges_1.28.3        
## [19] GenomeInfoDb_1.12.1          IRanges_2.10.2              
## [21] S4Vectors_0.14.2             BiocGenerics_0.22.0         
## 
## loaded via a namespace (and not attached):
##   [1] shinydashboard_0.6.0     R.utils_2.5.0           
##   [3] RSQLite_1.1-2            AnnotationDbi_1.38.0    
##   [5] htmlwidgets_0.8          grid_3.4.0              
##   [7] trimcluster_0.1-2        BiocParallel_1.10.1     
##   [9] RNeXML_2.0.7             DESeq_1.28.0            
##  [11] munsell_0.4.3            codetools_0.2-15        
##  [13] statmod_1.4.29           scran_1.4.4             
##  [15] DT_0.2                   miniUI_0.1.1            
##  [17] colorspace_1.3-2         energy_1.7-0            
##  [19] knitr_1.16               uuid_0.1-2              
##  [21] pspline_1.0-17           robustbase_0.92-7       
##  [23] bayesm_3.0-2             NMF_0.20.6              
##  [25] tximport_1.4.0           GenomeInfoDbData_0.99.0 
##  [27] hwriter_1.3.2            rhdf5_2.20.0            
##  [29] rprojroot_1.2            EDASeq_2.10.0           
##  [31] diptest_0.75-7           R6_2.2.1                
##  [33] doParallel_1.0.10        ggbeeswarm_0.5.3        
##  [35] taxize_0.8.4             locfit_1.5-9.1          
##  [37] flexmix_2.3-14           bitops_1.0-6            
##  [39] reshape_0.8.6            assertthat_0.2.0        
##  [41] scales_0.4.1             nnet_7.3-12             
##  [43] beeswarm_0.2.3           gtable_0.2.0            
##  [45] phylobase_0.8.4          RUVSeq_1.10.0           
##  [47] bold_0.4.0               rlang_0.1.1             
##  [49] genefilter_1.58.1        splines_3.4.0           
##  [51] rtracklayer_1.36.3       lazyeval_0.2.0          
##  [53] hexbin_1.27.1            rgl_0.98.1              
##  [55] yaml_2.1.14              reshape2_1.4.2          
##  [57] abind_1.4-5              GenomicFeatures_1.28.1  
##  [59] backports_1.1.0          httpuv_1.3.3            
##  [61] tensorA_0.36             tools_3.4.0             
##  [63] gridBase_0.4-7           stabledist_0.7-1        
##  [65] dynamicTreeCut_1.63-1    Rcpp_0.12.11            
##  [67] plyr_1.8.4               visNetwork_1.0.3        
##  [69] progress_1.1.2           zlibbioc_1.22.0         
##  [71] purrr_0.2.2.2            RCurl_1.95-4.8          
##  [73] prettyunits_1.0.2        viridis_0.4.0           
##  [75] zoo_1.8-0                cluster_2.0.6           
##  [77] data.table_1.10.4        RSpectra_0.12-0         
##  [79] mvtnorm_1.0-6            whisker_0.3-2           
##  [81] gsl_1.9-10.3             aroma.light_3.6.0       
##  [83] mime_0.5                 evaluate_0.10           
##  [85] xtable_1.8-2             XML_3.98-1.7            
##  [87] mclust_5.3               gridExtra_2.2.1         
##  [89] compiler_3.4.0           biomaRt_2.32.0          
##  [91] scater_1.4.0             tibble_1.3.3            
##  [93] KernSmooth_2.23-15       R.oo_1.21.0             
##  [95] htmltools_0.3.6          pcaPP_1.9-61            
##  [97] segmented_0.5-2.0        tidyr_0.6.3             
##  [99] geneplotter_1.54.0       howmany_0.3-1           
## [101] DBI_0.6-1                MASS_7.3-47             
## [103] fpc_2.1-10               MAST_1.2.1              
## [105] boot_1.3-19              compositions_1.40-1     
## [107] ShortRead_1.34.0         Matrix_1.2-10           
## [109] ade4_1.7-6               R.methodsS3_1.7.1       
## [111] gdata_2.17.0             igraph_1.0.1            
## [113] rncl_0.8.2               GenomicAlignments_1.12.1
## [115] registry_0.3             numDeriv_2016.8-1       
## [117] locfdr_1.1-8             plotly_4.7.0            
## [119] xml2_1.1.1               foreach_1.4.3           
## [121] annotate_1.54.0          vipor_0.4.5             
## [123] rngtools_1.2.4           pkgmaker_0.22           
## [125] XVector_0.16.0           stringr_1.2.0           
## [127] copula_0.999-16          ADGofTest_0.3           
## [129] softImpute_1.4           Biostrings_2.44.0       
## [131] rmarkdown_1.5            dendextend_1.5.2        
## [133] edgeR_3.18.1             kernlab_0.9-25          
## [135] shiny_1.0.3              Rsamtools_1.28.0        
## [137] gtools_3.5.0             modeltools_0.2-21       
## [139] rjson_0.2.15             nlme_3.1-131            
## [141] jsonlite_1.4             viridisLite_0.2.0       
## [143] limma_3.32.2             lattice_0.20-35         
## [145] httr_1.2.1               DEoptimR_1.0-8          
## [147] survival_2.41-3          FNN_1.1                 
## [149] prabclus_2.2-6           iterators_1.0.8         
## [151] glmnet_2.0-10            class_7.3-14            
## [153] stringi_1.1.5            mixtools_1.1.0          
## [155] latticeExtra_0.6-28      caTools_1.17.1          
## [157] memoise_1.1.0            dplyr_0.5.0             
## [159] ape_4.1